Succinct Sampling on Streams

نویسندگان

  • Vladimir Braverman
  • Rafail Ostrovsky
  • Carlo Zaniolo
چکیده

A streaming model is one where data items arrive over long period of time, either one item at a time or in bursts. Typical tasks include computing various statistics over a sliding window of some fixed time horizon. What makes the streaming model interesting is that as the time progresses, old items expire and new ones arrive. One of the simplest and most central tasks in this model is sampling. That is, the task of maintaining up to k uniformly distributed items from a current time-window. We call sampling algorithms succinct if they use provably optimal (up to constant factors) worst-case memory to maintain k items (either with or without replacement). We stress that in many applications, structures that have expected succinct representation as the time progresses are not sufficient. That is, expected small memory solutions in a streaming environment will never provide a fixed bounded memory guarantee over the lifetime of a (very large) stream, as small probability events eventually happen with probability 1. Thus, in this paper we ask the following question: Are succinct sampling on streams (or S algorithms) possible, and if so, for what models? Perhaps somewhat surprisingly, we show that S algorithms (i.e. with matching upper and lower bounds and worst case fixed memory guarantees) are possible for all variants of the problem mentioned above, i.e., both with and without replacement and both for one-at-a-time and bursty arrival models. In addition to fixed memory guarantees, our solution has additional benefits that are important in applications: in “one-item-at-a-time” model, the samples produced over non-overlapping windows are completely independent of each other (this was not the case for previous solutions) and in the bursty model, previous solutions required floating point computations; we do not. Finally, we use S algorithms to solve various problems in the sliding windows model, including frequency moments, counting triangles, entropy and density estimations. For these problems we present first solutions with provable worst-case memory guarantees. The results that we arrive at are based on the novel sampling method that could be of independent interest.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Succinct Representations of Ordinal Trees

We survey succinct representations of ordinal, or rooted, ordered trees. Succinct representations use space that is close to the appropriate information-theoretic minimum, but support operations on the tree rapidly, usually in O(1) time.

متن کامل

BSBC: Towards a Succinct Data Format for XML Streams

XML data compression is an important feature in XML data exchange, particularly when the data size may cause bottlenecks or when bandwidth and energy consumption limitations require reducing the amount of the exchanged XML data. However, applications based on XML data streams also require efficient path query processing on the structure of compressed XML data streams. We present a succinct repr...

متن کامل

Index Support for Mining Data Streams in a Relational DBMS

This paper presents a novel index, called I-Forest, to support data mining activities on data streams, i.e., sequences of incoming data blocks. This approach is appropriate for itemset extraction on evolving datasets such as analysis of transactional data streams from retail chains. The index is a covering structure that represents transaction blocks in a succinct form and allows different kind...

متن کامل

Array Range Queries

Array range queries are of current interest in the field of data structures. Given an array of numbers or arbitrary elements, the general array range query problem is to build a data structure that can efficiently answer queries of a given type stated in terms of an interval of the indices. The specific query type might be for the minimum element in the range, the most frequently occurring elem...

متن کامل

Evaluation of an alternate method for sampling benthic macroinvertebrates in low-gradient streams sampled as part of the National Rivers and Streams Assessment

Benthic macroinvertebrates are sampled in streams and rivers as one of the assessment elements of the US Environmental Protection Agency's National Rivers and Streams Assessment. In a 2006 report, the recommendation was made that different yet comparable methods be evaluated for different types of streams (e.g., low gradient vs. high gradient). Consequently, a research element was added to the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/cs/0702151  شماره 

صفحات  -

تاریخ انتشار 2007